Unsupervised Learning Project : Trade & Ahead¶

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks to maximize earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones that exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm that provides its customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Description¶

  • Ticker Symbol : An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company : Name of the company
  • GICS Sector : The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry : The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price : Current stock price in dollars
  • Price Change : Percentage change in the stock price in 13 weeks
  • Volatility : Standard deviation of the stock price over the past 13 weeks
  • ROE : A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio : The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow : The difference between a company's cash inflows and outflows (in dollars)
  • Net Income : Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share : Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding : Company's stock is currently held by all its shareholders
  • P/E Ratio : Ratio of the company's current stock price to the earnings per share
  • P/B Ratio : Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [ ]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# For cross-tabulation (or contingency table)
#from pandas import crosstab

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='darkgrid')

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from scipy.stats import zscore

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Dimentionality reduction
from sklearn.decomposition import PCA

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

Data Overview¶

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
data = pd.read_csv("/content/drive/MyDrive/AIML/Unsupervised_Learning/stock_data.csv")
In [ ]:
df = data.copy()

First and last 5 rows of the dataset

In [ ]:
# looking at head (first 5 observations)
df.head()
Out[ ]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 AAL American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219
1 ABBV AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068
2 ABT Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171
3 ADBE Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651
4 ADI Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810
In [ ]:
# looking at tail (last 5 observations)
df.tail()
Out[ ]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
335 YHOO Yahoo Inc. Information Technology Internet Software & Services 33.259998 14.887727 1.845149 15 459 -1032187000 -4359082000 -4.64 939457327.6 28.976191 6.261775
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 435353535.4 17.682214 -3.838260
337 ZBH Zimmer Biomet Holdings Health Care Health Care Equipment 102.589996 9.347683 1.404206 1 100 376000000 147000000 0.78 188461538.5 131.525636 -23.884449
338 ZION Zions Bancorp Financials Regional Banks 27.299999 -1.158588 1.468176 4 99 -43623000 309471000 1.20 257892500.0 22.749999 -0.063096
339 ZTS Zoetis Health Care Pharmaceuticals 47.919998 16.678836 1.610285 32 65 272000000 339000000 0.68 498529411.8 70.470585 1.723068
In [ ]:
#Checking the shape of the dataset
df.shape
Out[ ]:
(340, 15)
  • There are 340 row and 15 column present in dataset.
In [ ]:
#Checking the data types of the columns for the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB
  • There are 3 types of data present in dataset i.e object, float64, int64
  • Going forward we will convert object dtype into category
In [ ]:
df.duplicated().sum()
Out[ ]:
0
In [ ]:
df.isnull().sum()
Out[ ]:
0
Ticker Symbol 0
Security 0
GICS Sector 0
GICS Sub Industry 0
Current Price 0
Price Change 0
Volatility 0
ROE 0
Cash Ratio 0
Net Cash Flow 0
Net Income 0
Earnings Per Share 0
Estimated Shares Outstanding 0
P/E Ratio 0
P/B Ratio 0

  • Dataset has no missing or duplicate values
  • All columns with dtype object should be dtype category in order to conserve memory
In [ ]:
df.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Current Price 340.0 8.086234e+01 9.805509e+01 4.500000e+00 3.855500e+01 5.970500e+01 9.288000e+01 1.274950e+03
Price Change 340.0 4.078194e+00 1.200634e+01 -4.712969e+01 -9.394838e-01 4.819505e+00 1.069549e+01 5.505168e+01
Volatility 340.0 1.525976e+00 5.917984e-01 7.331632e-01 1.134878e+00 1.385593e+00 1.695549e+00 4.580042e+00
ROE 340.0 3.959706e+01 9.654754e+01 1.000000e+00 9.750000e+00 1.500000e+01 2.700000e+01 9.170000e+02
Cash Ratio 340.0 7.002353e+01 9.042133e+01 0.000000e+00 1.800000e+01 4.700000e+01 9.900000e+01 9.580000e+02
Net Cash Flow 340.0 5.553762e+07 1.946365e+09 -1.120800e+10 -1.939065e+08 2.098000e+06 1.698108e+08 2.076400e+10
Net Income 340.0 1.494385e+09 3.940150e+09 -2.352800e+10 3.523012e+08 7.073360e+08 1.899000e+09 2.444200e+10
Earnings Per Share 340.0 2.776662e+00 6.587779e+00 -6.120000e+01 1.557500e+00 2.895000e+00 4.620000e+00 5.009000e+01
Estimated Shares Outstanding 340.0 5.770283e+08 8.458496e+08 2.767216e+07 1.588482e+08 3.096751e+08 5.731175e+08 6.159292e+09
P/E Ratio 340.0 3.261256e+01 4.434873e+01 2.935451e+00 1.504465e+01 2.081988e+01 3.176476e+01 5.280391e+02
P/B Ratio 340.0 -1.718249e+00 1.396691e+01 -7.611908e+01 -4.352056e+00 -1.067170e+00 3.917066e+00 1.290646e+02
In [ ]:
#provide statistical summary of all categorical columns
df.describe(include='category').T
Out[ ]:
count unique top freq
Ticker Symbol 340 340 AAL 1
Security 340 340 3M Company 1
GICS Sector 340 11 Industrials 53
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16

Observations:

  • There are 11 unique GICS Sectors and 104 unique GICS Sub Industries, suggesting a wide representation of the market.
  • Significant range and variability in Current Price, Price Change, and Volatility.
  • ROE and Cash Ratio have a wide range of values.
  • Net Cash Flow, Net Income, P/E Ratio, and P/B Ratio have value significantly highers than 75% suggesting outliers.
In [ ]:
# Convert all object type columns to category
df = df.apply(lambda col : col.astype("category") if col.dtype == "object"
else col)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Ticker Symbol                 340 non-null    category
 1   Security                      340 non-null    category
 2   GICS Sector                   340 non-null    category
 3   GICS Sub Industry             340 non-null    category
 4   Current Price                 340 non-null    float64 
 5   Price Change                  340 non-null    float64 
 6   Volatility                    340 non-null    float64 
 7   ROE                           340 non-null    int64   
 8   Cash Ratio                    340 non-null    int64   
 9   Net Cash Flow                 340 non-null    int64   
 10  Net Income                    340 non-null    int64   
 11  Earnings Per Share            340 non-null    float64 
 12  Estimated Shares Outstanding  340 non-null    float64 
 13  P/E Ratio                     340 non-null    float64 
 14  P/B Ratio                     340 non-null    float64 
dtypes: category(4), float64(7), int64(4)
memory usage: 58.1 KB

Exploratory Data Analysis (EDA)¶

In [ ]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=12)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="viridis",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
# function to plot a boxplot and a histogram along the same scale

def histogram_boxplot(data, feature, figsize=(16, 6), kde=False, bins=None, hue=None):
    """
    Combines boxplot and histogram

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (16,6))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True,
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter",
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Observation on GICS Sector

In [ ]:
#create labeled barplot of stocks by sector
labeled_barplot(df, 'GICS Sector')
In [ ]:
#display the five sectors with the most number of stocks
df["GICS Sector"].value_counts(normalize = True).sort_values(ascending = False).head(n=5)
Out[ ]:
proportion
GICS Sector
Industrials 0.155882
Financials 0.144118
Consumer Discretionary 0.117647
Health Care 0.117647
Information Technology 0.097059

  • The stocks are drawn from 11 different industrial sectors, with no one sector comprising more than 16% of the dataset
  • The top 4 of the 11 sectors (industrials, financials, consumer discretionary, and health care) comprise over half of the total number of stocks

GICS Sub Industry

In [ ]:
#create labeled barplot of stocks by sub industry
labeled_barplot(df, 'GICS Sub Industry')
In [ ]:
#display the five sub industries with the most number of stocks
df['GICS Sub Industry'].value_counts().head(n=5)
Out[ ]:
count
GICS Sub Industry
Oil & Gas Exploration & Production 16
REITs 14
Industrial Conglomerates 14
Internet Software & Services 12
Electric Utilities 12

  • The dataset is comprised of stocks from 104 different subindustries, with no subindustry having more than 16 stocks in the dataset
  • These observations indicate that the 340 stocks held within the dataset are highly diversified across sectors and subindustries

Numerical Columns

In [ ]:
#create list of columns with numerical variables
num_col = df.select_dtypes(include=np.number).columns.tolist()

#display histograms and boxplots for all numerical columns
for col in num_col:
    histogram_boxplot(df, col)

Current price

  • The distribution is heavily right skewed, with 49 of the 340 stocks having twice the median value of all stocks
  • As expected, no stock is listed at less of less than 0 dollars

Price change

  • The distribution is biased towards lower volatilities, but long tails do exist both for positive and negative price changes
  • The most volatile stocks show as low as a 47% decrease to as high as a 55% increase over 13 weeks

Volatility

  • As expected, the distribution of standard deviations is right skewed and not normal

Cash Ratio / ROE

  • As expected, both distributions are heavily right skewed and no stock is listed with either metric with a value of less than 0
  • For example, 24 stocks are listed with returns on equity of less than 5 and 25 stocks are listed with returns of over 100 percent

Net Income / EPS

  • As expected, net income is shown to be right skewed with both long positive and negative tails I.e., most companies generate meager profits, but some are failing and some are highly successful
  • 32 companies within the dataset are showing a net income of less than 0 dollars
  • EPS, as a derivative of Net Income, shows a similar distribution, with most showing low positive values and a few stocks (34) showing negative values

Estimated shares outstanding

  • The distribution is highly right skewed, but no stock has a value of outstanding shares that is unrealistic

P/E and P/B Ratio

  • The distribution of P/E ratios is highly right skewed Interestingly, no stock shows a negative ratio, even though several stocks have a negative EPS and no stock stock has a price listed of less than 0
  • The distribution for P/B ratios is mostly centered around 0 but with long positive and negative For example, 175 of the 340 total stocks are shown to below the 25th percentile and above the 75th percentile and Additionally, 31 of the stocks are outliers

Conclusion

  • As expected, stocks offer uncertain returns with high upsides, mostly modest returns, and the omnipresent possibility that the value of the stock may become worthless (i.e., the company goes bankrupt)
  • All of these variables contain a few or several outliers; however, none of these values appear to be unrealistic given the nature of stock prices and historical expectations

EDA Questions:¶

1. What does the distribution of stock prices look like?

In [ ]:
# Histogram
plt.figure(figsize=(10, 4))
sns.histplot(df['Current Price'], kde=True)
plt.title('Histogram of Stock Prices')
plt.xlabel('Current Price ($)')
plt.ylabel('Frequency')
plt.show()

# Boxplot
plt.figure(figsize=(10, 4))
sns.boxplot(x=df['Current Price'])
plt.title('Boxplot of Stock Prices')
plt.xlabel('Current Price ($)')
plt.show()

Observations:

The distribution of stock prices is positively skewed with most of the stock prices concentrated at the lower end of the price range and a long tail stretching towards higher prices, indicating that there are a few stocks with very high prices compared to the majority.

2. The stocks of which economic sector has seen the maximum price increase on average?

In [ ]:
# Group by 'GICS Sector' and calculate the mean 'Price Change'
avg_price_increase_by_sector = df.groupby('GICS Sector')['Price Change'].mean().reset_index()

# Sort the results
avg_price_increase_by_sector_sorted = avg_price_increase_by_sector.sort_values('Price Change', ascending=False)

# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Price Change', y='GICS Sector', data=avg_price_increase_by_sector_sorted, palette='Blues_r')
plt.title('Average Price Increase by GICS Sector')
plt.xlabel('Average Price Increase (%)')
plt.ylabel('GICS Sector')
plt.show()

Observations:

The bar chart indicates that on average, the Information Technology sector has experienced the highest price increase among all GICS sectors, while the Energy sector has seen a price decrease. Other sectors like Health Care, Consumer Discretionary, and Materials also show notable average price increases.

3. How are the different variables correlated with each other?

In [ ]:
#create correlation heat map for numerical variables
plt.figure(figsize=(14, 7))
sns.heatmap(
    df[num_col].corr(),
    annot=True,
    vmin=-1,
    vmax=1,
    fmt=".2f",
    cmap='Blues'
)
plt.show()

Observations:

  • Current Price and Earnings Per Share (EPS) have a moderate positive correlation (0.48), suggesting that stocks with higher prices may also have higher earnings per share.
  • There is a notable positive correlation between Net Income and Earnings Per Share (0.56), as well as between Net Income and Estimated Shares Outstanding (0.59), indicating that companies with higher net income tend to have higher earnings per share and may have a greater number of shares outstanding.
  • Volatility has a moderate negative correlation with Price Change (-0.41), which could mean that stocks with higher volatility might see less price increase on average. Return on Equity (ROE) shows little to no correlation with most other metrics, indicating its independence as a performance measure.
  • The P/E Ratio (Price to Earnings) shows low correlation with other variables, suggesting it is not strongly influenced by the measured financial metrics in this dataset.

4. Cash ratio measures a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?

In [ ]:
# Calculate the mean 'Cash Ratio' for each 'GICS Sector'
avg_cash_ratio_by_sector = df.groupby('GICS Sector')['Cash Ratio'].mean().reset_index()

# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Cash Ratio', y='GICS Sector', data=avg_cash_ratio_by_sector, palette='YlOrRd_r')
plt.title('Average Cash Ratio by GICS Sector')
plt.xlabel('Average Cash Ratio')
plt.ylabel('GICS Sector')
plt.show()

Observations:

IT and Telecom Services sectors have the lhighest average cash ratios meaning financial stability.

5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

In [ ]:
# Calculate the mean 'P/E Ratio' for each 'GICS Sector'
avg_pe_ratio_by_sector = df.groupby('GICS Sector')['P/E Ratio'].mean().reset_index()

# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='P/E Ratio', y='GICS Sector', data=avg_pe_ratio_by_sector, palette='Oranges_r')
plt.title('Average P/E Ratio by GICS Sector')
plt.xlabel('Average P/E Ratio')
plt.ylabel('GICS Sector')
plt.show()

Observations:

A longer bar indicates a higher P/E ratio, like Energy, suggesting that, on average, stocks in that sector may be priced higher relative to their earnings.

Functions for Numerical and Categorical Feature Visualisation

In [ ]:
def hist_and_boxplot(data, variable, figsize=(12, 4), kde=False, bins=None):
    """
    Creates a plot with both a histogram and boxplot for a specified numerical variable.

    Args:
    - data: The DataFrame containing the data.
    - variable: The name of the numerical variable (feature) to be plotted.
    - figsize: A tuple representing the size of the figure.
    - density_curve: A boolean indicating whether to overlay a density curve curve on the histogram.
    - bins: An integer representing the number of bins for the histogram, or None for automatic bin size.

    Returns:
    None
    """
    # Set up the matplotlib figure with two rows and one column
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=figsize, sharex=True, gridspec_kw={'height_ratios': [0.2, 0.8]})

    # Plot the boxplot on the first row
    sns.boxplot(x=variable, data=data, ax=ax1, showmeans=True, color="lightblue")
    ax1.set(xlabel='', title=f'Boxplot and Distribution of {variable}')

    # Plot the histogram on the second row
    if bins:
        sns.histplot(data[variable], kde=kde, bins=bins, ax=ax2, color="lightblue")
    else:
        sns.histplot(data[variable], kde=kde, ax=ax2, color="lightblue")

 # Draw lines for mean and median
    mean_val = data[variable].mean()
    median_val = data[variable].median()
    ax2.axvline(mean_val, color='green', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
    ax2.axvline(median_val, color='black', linestyle='-', linewidth=2, label=f'Median: {median_val:.2f}')

    # Add legend to the histogram
    ax2.legend()

    plt.show()
In [ ]:
def custom_barplot(data, feature, perc=False, top_n=None):
    """
    Creates a bar plot with labels for each bar from the dataframe data.

    Args:
    - data: The DataFrame containing the data.
    - feature: The column name to plot.
    - perc: A boolean to indicate whether to display the percentage (True) or count (False).
    - top_n: An integer to display only the top N categories, or None to display all.

    Returns:
    None
    """
    total_entries = len(data[feature])  # The total number of entries
    category_counts = data[feature].value_counts()

    # Determine the figure size based on the number of categories
    fig_width = top_n + 2 if top_n is not None else category_counts.size + 2
    plt.figure(figsize=(fig_width, 6))

    # Create the bar plot
    category_order = category_counts.index[:top_n] if top_n is not None else category_counts.index
    barplot = sns.countplot(x=feature, data=data, order=category_order, palette='Blues_r')

    # Add labels to each bar
    for bar in barplot.patches:
        label_x_pos = bar.get_x() + bar.get_width() / 2
        if perc:
            label = f"{100 * bar.get_height() / total_entries:.1f}%"
        else:
            label = f"{bar.get_height()}"
        barplot.annotate(
            label,
            (label_x_pos, bar.get_height()),
            ha='center',
            va='bottom',
            fontsize=11,
            textcoords='offset points',
            xytext=(0, 0)
        )

    # Rotate x-axis labels if there are many categories
    if category_counts.size > 10:
        plt.xticks(rotation=45, ha='right')

    # Adding a title
    plt.title(f'Distribution of {feature}', fontsize=15)

    plt.show()

Numerical Data

In [ ]:
# Get a list of the numerical variables
num_var = data.select_dtypes(include=['int64','float64']).columns
print(num_var)
Index(['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
       'Net Cash Flow', 'Net Income', 'Earnings Per Share',
       'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'],
      dtype='object')
In [ ]:
# Plot all the numerical variables in the data DataFrame
numerical_variables = [
    'Current Price',
    'Price Change',
    'Volatility',
    'ROE',
    'Cash Ratio',
    'Net Cash Flow',
    'Net Income',
    'Earnings Per Share',
    'Estimated Shares Outstanding',
    'P/E Ratio',
    'P/B Ratio'
    ]
# Loop through the list and creating a histogram and boxplot for each variable
for variable in numerical_variables:
   hist_and_boxplot(data, variable)

Observations:

  • Current Price: A right-skewed distribution indicating that most stocks are priced lower, with fewer stocks having very high prices.
  • Price Change: A distribution including both positive and negative changes with long tails suggests significant variation in stock price movements within the time period analyzed.
  • Volatility: Right skewness in volatility implies that while most stocks have lower volatility, there are a few with exceptionally high volatility.
  • ROE: A right-skewed distribution here indicates that most stocks have lower returns on equity, with outliers potentially representing very profitable or very unprofitable firms.
  • Cash Ratio: A higher cash ratio generally suggests that a company has more liquid assets compared to its liabilities, which might be interpreted as a sign of good short-term financial health.
  • Net Cash Flow: A distribution that is centered around zero with outliers might indicate that while most companies manage to balance their cash flow, some have significant excess or shortfall.
  • Net Income: Right skewness in net income suggests most companies earn a modest income, with some outliers indicating substantial profits or losses.
  • Earnings Per Share: A distribution similar to net income, as it is derived from it, indicates a majority of stocks with modest earnings per share and some with particularly high or low values.
  • P/E Ratio: A right-skewed P/E ratio distribution indicates most stocks have a lower price compared to their earnings, with some having exceptionally high ratios which could suggest overvaluation.
  • P/B Ratio: A distribution centered around a median with outliers on both ends suggests a varied market perception of the value of companies' assets

Categorical Data

In [ ]:
# Create a barplot for 'GICS Sector'
custom_barplot(data=df, feature='GICS Sector', perc=True, top_n=None)

Observations:

Industrials and Financials sectors lead in distribution, while Telecommunications Services is the least represented in the dataset.

EDA Insights¶



The EDA reveals a diverse and dynamic stock market landscape. Key takeaways include:

  • Sector concentration: Industrials and Financials are dominant.
  • Price distribution: Skewed with some high-priced stocks.
  • Volatility: Present, with varying degrees across stocks and sectors.
  • Correlations: Some variables are moderately correlated, suggesting relationships to explore.
  • Financial health: Cash ratios and P/E ratios offer insights into company performance and valuation.
  • Outliers: Require further investigation to understand their impact. These findings provide a foundation for deeper analysis and informed investment decisions.

Based on the EDA, here are some recommendations:

  • Further investigate outliers: Identify and analyze the outliers present in variables like Current Price, ROE, and P/B Ratio. This can help in understanding if they are data errors or represent unique stock characteristics.
  • Deeper dive into sectors: Analyze specific sectors like Information Technology and Energy, which show high average price increase and P/E ratio, respectively. This can help identify potential investment opportunities.
  • Explore correlations: Further analyze the correlations between variables like Current Price and EPS, and Volatility and Price Change. Consider feature engineering to create new variables that might explain the relationships better.
  • Consider external factors: Investigate the impact of external factors like economic conditions and news events on stock prices and other financial indicators.
  • Cluster analysis: Use clustering techniques to group similar stocks based on their financial indicators. This can help in creating diversified investment portfolios.
  • Predictive modeling: Develop predictive models to forecast stock prices or other relevant metrics based on historical data and identified patterns. These recommendations can guide further analysis and help in making informed investment decisions.


Data Preprocessing¶

In [ ]:
# Drop Ticker Symbol since it is a unique identifier (same as 'Security')
df.drop(['Ticker Symbol'], axis=1, inplace=True)
In [ ]:
df.columns
Out[ ]:
Index(['Security', 'GICS Sector', 'GICS Sub Industry', 'Current Price',
       'Price Change', 'Volatility', 'ROE', 'Cash Ratio', 'Net Cash Flow',
       'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
       'P/E Ratio', 'P/B Ratio'],
      dtype='object')
In [ ]:
# Drop categorical variables before scaling
df_num = df.select_dtypes(include=[np.number])

Data Scaling¶

In [ ]:
# Apply the z score to each column
df_scaled=df_num.apply(zscore)
In [ ]:
# Check the sclaed dataframe
df_scaled.head()
Out[ ]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196
In [ ]:
sns.pairplot(df_scaled, height=2,aspect=2 , diag_kind='kde')
plt.show()

Observation:

  • The diagonal KDE of P/C Ratio seems to have 2 peaks but are barely noticable.
  • There are no distinct grouping in the scatterplots either with no clear correlation between variables.
  • The outliers present warrent further investigation.
  • This doesn't help us much in determining the number of clusters.

Principal Component Analysis (PCA)¶

In [ ]:
# Calculate the covariance matrix (captures the variance and covariance between features)
covMatrix = np.cov(df_scaled,rowvar=False)
print(covMatrix)
[[ 1.00294985e+00  1.35379678e-01 -1.24623162e-01 -5.51076941e-04
   1.28193172e-01 -2.20255018e-02  3.67332609e-02  4.81018703e-01
  -1.45735035e-01  2.61040375e-01  1.25001740e-01]
 [ 1.35379678e-01  1.00294985e+00 -4.09485757e-01 -4.34381685e-02
   1.69082964e-01  2.68245310e-02  1.79826528e-01  1.75918225e-01
   3.37550438e-02 -2.49801709e-03  3.44304592e-02]
 [-1.24623162e-01 -4.09485757e-01  1.00294985e+00  1.63011319e-01
   2.06659452e-02 -1.42137739e-02 -3.84563951e-01 -3.80427812e-01
  -9.59867699e-02  2.64377469e-01  4.66653027e-02]
 [-5.51076941e-04 -4.34381685e-02  1.63011319e-01  1.00294985e+00
  -6.93260015e-02 -5.23589005e-02 -2.89375193e-01 -4.06486472e-01
  -3.03773642e-02  2.34095558e-02 -6.34048877e-02]
 [ 1.28193172e-01  1.69082964e-01  2.06659452e-02 -6.93260015e-02
   1.00294985e+00  1.14154326e-01  2.86730699e-02  2.48317919e-02
   1.36269927e-01  8.97470272e-02  2.30349368e-01]
 [-2.20255018e-02  2.68245310e-02 -1.42137739e-02 -5.23589005e-02
   1.14154326e-01  1.00294985e+00  4.43591948e-02  1.94047846e-02
  -5.15439469e-02  2.69553828e-02  5.72075893e-02]
 [ 3.67332609e-02  1.79826528e-01 -3.84563951e-01 -2.89375193e-01
   2.86730699e-02  4.43591948e-02  1.00294985e+00  5.59397718e-01
   5.91048859e-01 -2.24683029e-01 -6.17250714e-02]
 [ 4.81018703e-01  1.75918225e-01 -3.80427812e-01 -4.06486472e-01
   2.48317919e-02  1.94047846e-02  5.59397718e-01  1.00294985e+00
  -2.69850312e-02 -2.61428450e-01  1.49589029e-02]
 [-1.45735035e-01  3.37550438e-02 -9.59867699e-02 -3.03773642e-02
   1.36269927e-01 -5.15439469e-02  5.91048859e-01 -2.69850312e-02
   1.00294985e+00 -1.45943101e-02 -2.82382975e-02]
 [ 2.61040375e-01 -2.49801709e-03  2.64377469e-01  2.34095558e-02
   8.97470272e-02  2.69553828e-02 -2.24683029e-01 -2.61428450e-01
  -1.45943101e-02  1.00294985e+00  3.65292109e-02]
 [ 1.25001740e-01  3.44304592e-02  4.66653027e-02 -6.34048877e-02
   2.30349368e-01  5.72075893e-02 -6.17250714e-02  1.49589029e-02
  -2.82382975e-02  3.65292109e-02  1.00294985e+00]]
In [ ]:
# Initialize PCA with 10 components and fit model to scaled data
pca = PCA(n_components=10)
pca.fit(df_scaled)
Out[ ]:
PCA(n_components=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=10)
In [ ]:
# PCA results
# Print the eigenVALUES associated with each principal component, which indicate the variance captured by each component
print(pca.explained_variance_)
[2.51212401 1.58411327 1.33434149 1.11310406 1.09308721 0.95662194
 0.82731447 0.71754029 0.4845015  0.2833637 ]
In [ ]:
# Print the eigenVECTORS (principal components), which define the directions of maximum variance in the data
print(pca.components_)
[[ 0.18335823  0.27666842 -0.42919618 -0.30106508  0.08417114  0.03368814
   0.51091003  0.5033668   0.22120756 -0.20815091  0.00991673]
 [ 0.5774047   0.18815749  0.04101083 -0.06258117  0.32773726  0.1280034
  -0.25155831  0.18433241 -0.36327549  0.37736302  0.360796  ]
 [-0.17770732  0.04646452  0.18283759  0.00473711  0.53919175  0.1855024
   0.20918754 -0.27531285  0.58166569  0.24934832  0.2959799 ]
 [-0.00900284  0.65971773 -0.38664453  0.53431652  0.0838151  -0.13655589
  -0.12101272 -0.26011429  0.02333621 -0.00303257 -0.14551133]
 [-0.39287414  0.12202474 -0.14571667 -0.12446023  0.1588855   0.56275703
  -0.17221048 -0.06902206 -0.32807051 -0.47228787  0.29493986]
 [ 0.04539895  0.07374983 -0.01301984 -0.04924358 -0.1433002   0.69439441
   0.07660902 -0.02703535  0.00607268  0.39241333 -0.571456  ]
 [ 0.34597383 -0.32918961  0.06740711  0.69551664 -0.04384053  0.31995867
   0.15598116  0.16995766  0.13220581 -0.30317218  0.12940702]
 [ 0.03785302 -0.10485332  0.17805858  0.00977663  0.69685159 -0.13119433
  -0.09410062  0.13610016 -0.12727392 -0.29100372 -0.56982313]
 [-0.05374498  0.54680565  0.75429579  0.04388306 -0.18199001  0.01575891
   0.16200469  0.20518699 -0.03660507 -0.15469563  0.01551126]
 [ 0.47507303  0.10833734  0.05038953 -0.32004423 -0.14366391  0.09802191
  -0.43017307 -0.28501031  0.45036866 -0.39076807 -0.07444627]]
In [ ]:
# Print the proportion of variance explained by each component
print(pca.explained_variance_ratio_)
[0.22770322 0.14358674 0.120947   0.10089366 0.09907929 0.08670985
 0.0749892  0.06503908 0.04391604 0.02568457]
In [ ]:
# Create a bar chart showing the proportion of variance explained by each principal component
plt.bar(list(range(1,11)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.title("Individual Explained Variance Visualization")
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
In [ ]:
# Create a step plot showing the cumulative proportion of variance explained by the components
plt.step(list(range(1,11)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.title("Cumulative Explained Variance Visualization")
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

Observations:

This helps to understand how many components are needed to explain most of the data and it seems like all the components are needed. This suggests that each component is adding some information, and we need all to capture most of the variance in the dataset.

Note: reducing the dimentionality of the data is advised for supervised learning where we have a target variable. For unsupervised learning outliers are data points which can be excluded from the dataset which effect the clustering. In this case we are keeping them but we'd target those as oppose to reducing the number of variables all together.

K-means Clustering¶

In [ ]:
#print average distortions for range of kmeans models fitted to scaled dataset
clusters = range(1, 11)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k)
    model.fit(df_scaled)
    prediction = model.predict(df_scaled)
    distortion = (
        sum(
            np.min(cdist(df_scaled, model.cluster_centers_, "euclidean"), axis=1)
        )
        / df_scaled.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.4359794966724415
Number of Clusters: 3 	Average Distortion: 2.3666102174035797
Number of Clusters: 4 	Average Distortion: 2.2150707100717115
Number of Clusters: 5 	Average Distortion: 2.148999515589566
Number of Clusters: 6 	Average Distortion: 2.0986142664921537
Number of Clusters: 7 	Average Distortion: 2.037523546710418
Number of Clusters: 8 	Average Distortion: 1.993329796710798
Number of Clusters: 9 	Average Distortion: 1.9053944614353846
Number of Clusters: 10 	Average Distortion: 1.8673989349526698
In [ ]:
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=42)
visualizer = KElbowVisualizer(model, k=(1, 11), timings=True)
visualizer.fit(df_scaled)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
plt.show()
In [ ]:
#fit KMeans model and provide silhouette scores for range of k clusters
sil_score = []
cluster_list = range(2, 11)
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=42)
    preds = clusterer.fit_predict((df_scaled))
    score = silhouette_score(df_scaled, preds)
    sil_score.append(score)
    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))

#show scores in line graph
plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.4484662124740991)
For n_clusters = 3, the silhouette score is 0.4419662311474576)
For n_clusters = 4, the silhouette score is 0.45434371948348606)
For n_clusters = 5, the silhouette score is 0.4670141851729817)
For n_clusters = 6, the silhouette score is 0.16609442972204497)
For n_clusters = 7, the silhouette score is 0.16010466610937854)
For n_clusters = 8, the silhouette score is 0.3811731688385819)
For n_clusters = 9, the silhouette score is 0.16110839502412727)
For n_clusters = 10, the silhouette score is 0.15641849202440433)
In [ ]:
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=42)
visualizer = KElbowVisualizer(model, k=(2, 11), metric="silhouette", timings=True)
visualizer.fit(df_scaled)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
plt.show()
In [ ]:
#find optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=42))
visualizer.fit(df_scaled)
visualizer.show()
plt.show()

Between the Elbow and Silhouette plots, the number of clusters with the best performance appears to be 5

In [ ]:
# Initialize the KMeans with k=4
model4 = KMeans(n_clusters=4, random_state=1)

# Initialize the SilhouetteVisualizer with the KMeans model
visualizer = SilhouetteVisualizer(model4, colors='yellowbrick')

# Fit the visualizer to your scaled data
visualizer.fit(df_scaled)

# Display the silhouette plot
visualizer.show()
Out[ ]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

The model with k=4 clusters has a slightly higher average silhouette score, suggesting that dividing the data into 4 clusters leads to the most distinct and well-separated groupings but the highest silhouette score is observed for n_clusters = 5. So, Investigate 5.

Cluster Profiling¶

k-Means=5 Model

In [ ]:
# Initialize KMeans with 5 clusters
kmeans5_model = KMeans(n_clusters=5, random_state=42)

# Fit KMeans model
kmeans5_model.fit(df_scaled)

# Model predicts the cluster for each data point in the scaled data
prediction=kmeans5_model.predict(df_scaled)
In [ ]:
# Assign cluster labels to DataFrames

# Add a new column to the original dataframe and assigns the cluster labels to it
df["KMeans_Cluster"] = prediction

df_scaled["KMeans_Cluster"] = prediction # do it to the scaled dataframe as well
In [ ]:
print("Groups Assigned : \n")
df.head()
Groups Assigned : 

Out[ ]:
Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio KMeans_Cluster
0 American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219 2
1 AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068 2
2 Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171 2
3 Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651 2
4 Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810 4
In [ ]:
# Create boxplots for each feature within each cluster
df_scaled.boxplot(by='KMeans_Cluster', layout = (4,3),figsize=(30,30))
plt.show()

Observations:

  • Almost all clusters have their medians close to zero indicating a consistency within these clusters.

  • Cash Ratio: Cluster 4 has a slightly higher median cash ratio. Outliers are present in clusters 0 and 4, suggesting a few companies with high cash ratios.

  • Current Price: Cluster 4 has a higher median price and more variability. Outliers are present in most clusters, particularly clusters 0 and 4, indicating some stocks with higher prices.

  • Earnings Per Share: Clusters 3 and 4 having a wider interquartile range, indicating more variability within those clusters.

  • Estimated Shares Outstanding: Cluster 4 shows a wider interquartile range. Outliers are present in cluster 0, indicating some companies with a high number of shares outstanding.

  • Net Cash Flow: Clusters 0 and 1 show wider distributions with the median around zero. Clusters 2, 3, and 4 have tighter distributions. There are outliers in clusters indicating some companies with either high or low net cash flow.

  • Net Income: Cluster 0 has a wider distribution, and cluster 4 shows some high-income outliers.

  • P/B Ratio: The price-to-book ratio across all clusters is centered around zero, with cluster 4 having the widest interquartile range and outliers, indicating more variability in the P/B ratio within this cluster.

  • P/E Ratio: The price-to-earnings ratio showing some outliers, indicating a few companies with high or low P/E ratios.

  • Price Change: All clusters have medians around zero with similar stock price changes.

  • ROE: The return on equity is centered around zero for all clusters.

  • Volatility: Cluster 3 shows a higher median volatility and a wider spread, indicating this cluster contains stocks that are more volatile.

In [ ]:
# Generate a color map with a unique color for each bar
num_of_bars = len(df_scaled.columns) - 1  # Subtract 1 for the GROUP column
cm = plt.get_cmap('BuPu_r')
colors = [cm(1. * i / num_of_bars) for i in range(num_of_bars)]

# Plot the bar chart with unique colors
grouped_means = df_scaled.groupby("KMeans_Cluster").mean()
grouped_means.plot.bar(figsize=(15, 8), color=colors)
plt.title("KMeans of Clusters")
plt.show()

Observations:

  • Some bars are negative, it suggests that the mean of that feature in a particular cluster is below zero.
  • Data includes negative values naturally (e.g., financial data with profits and losses), the means could be negative, reflecting the nature of the data in that cluster.

Observations over K Means¶

Visualization of the mean values of each feature within scaled dataset, grouped by KMeans clusters.

Cluster 2: low variability within this group suggests that the data points in this cluster are quite similar to each other across all the features.

Cluster 0 and 3: show high variability in a group capturing a wider range of characteristics within its members.

Cluster 1 and 4: show mostly positive values. Current Price in Cluster 4 shows significant presence which means it is a significant feature of this group.

Some features P/E Ratio and P/B Ratio seem to follow similar trends.

Hierarchical Clustering¶

In [ ]:
# Create a new DataFrame without the 'KMeans_Cluster' labels
df_scaled_hclust = df_scaled.drop(columns=['KMeans_Cluster'], errors='ignore')
In [ ]:
df_scaled_hclust.head()
Out[ ]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196

Exploring Different Distance Metrics and Linkage Methods¶

Calculate the Cophenetic Correlation Coefficient (CCC)

  • This step uses the linkage matrix Z and the original distance matrix (from the scaled data) to compute the CCC.
  • Ranges from -1 to 1.
  • Measures how good the hierarchical clustering method has done a reasonable job of grouping together. The higher the better.
In [ ]:
# Step 1: Exploring Different Distance Metrics and Linkage Methods

distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
linkage_methods = ["single", "complete", "average", "weighted"] # "centroid" and "ward" require Euclidean distance

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(df_scaled_hclust, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(df_scaled_hclust))
        print(f"Cophenetic correlation for {dm} distance and {lm} linkage is {c}.")
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm = [dm, lm]

print(f"\n Highest CCC is: {high_cophenet_corr}, obtained with {high_dm_lm[0]} distance and {high_dm_lm[1]} linkage.")
Cophenetic correlation for euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for euclidean distance and weighted linkage is 0.8693784298129404.
Cophenetic correlation for chebyshev distance and single linkage is 0.9062538164750717.
Cophenetic correlation for chebyshev distance and complete linkage is 0.598891419111242.
Cophenetic correlation for chebyshev distance and average linkage is 0.9338265528030499.
Cophenetic correlation for chebyshev distance and weighted linkage is 0.9127355892367.
Cophenetic correlation for mahalanobis distance and single linkage is 0.925919553052459.
Cophenetic correlation for mahalanobis distance and complete linkage is 0.7925307202850002.
Cophenetic correlation for mahalanobis distance and average linkage is 0.9247324030159736.
Cophenetic correlation for mahalanobis distance and weighted linkage is 0.8708317490180428.
Cophenetic correlation for cityblock distance and single linkage is 0.9334186366528574.
Cophenetic correlation for cityblock distance and complete linkage is 0.7375328863205818.
Cophenetic correlation for cityblock distance and average linkage is 0.9302145048594667.
Cophenetic correlation for cityblock distance and weighted linkage is 0.731045513520281.

 Highest CCC is: 0.9422540609560814, obtained with euclidean distance and average linkage.
In [ ]:
# Step 2: Focusing on Euclidean Distance with the linkage methods including "centroid" and "ward"
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

for lm in linkage_methods:
    Z = linkage(df_scaled_hclust, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(df_scaled_hclust))
    print(f"Cophenetic correlation for Euclidean distance and {lm} linkage is {c}.")
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for Euclidean distance and centroid linkage is 0.9314012446828154.
Cophenetic correlation for Euclidean distance and ward linkage is 0.7101180299865353.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
In [ ]:
# Step 3: Plotting Dendrograms for Different Linkage Methods
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

for i, method in enumerate(linkage_methods):
    Z = linkage(df_scaled_hclust, metric="euclidean", method=method)
    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
    coph_corr, coph_dist = cophenet(Z, pdist(df_scaled_hclust))
    axs[i].annotate(f"Cophenetic\nCorrelation\n{coph_corr:0.2f}", (0.80, 0.80), xycoords="axes fraction")

plt.tight_layout()
plt.show()

Observations:

  • The cophenetic correlation is highest for Average Linkage followed by Centroid Linkage methods.
  • 6 appears to be the appropriate number of clusters from the dendrogram for Average Linkage.
In [ ]:
# Use the linkage matrix from the 'average' method for dendrogram and cluster labels
Z = linkage(df_scaled_hclust, metric='euclidean', method='average')

# Plot the dendrogram
plt.figure(figsize=(12, 8))
dendrogram(Z)

# Find the distance at which the 6th last merge occurs
# This is where the get 6 clusters from the dendrogram
threshold_distance = Z[-(6), 2]  # The 6th last merge

plt.title('Hierarchical Clustering Dendrogram (Average Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')

# Add a horizontal line to show where we cut the dendrogram to get 5 clusters
plt.axhline(y=threshold_distance, color='r', linestyle='--')
plt.text(x=plt.gca().get_xlim()[1], y=threshold_distance, s=' 6 clusters cut-off', verticalalignment='center', color='r')

plt.show()

Cluster Profiling¶

In [ ]:
# Create the Agglomerative Clustering model with 6 clusters
hclust6_model = AgglomerativeClustering(n_clusters=6, metric='euclidean',  linkage='average')

# Fit the model on scaled data
hclust6_model.fit(df_scaled_hclust)
Out[ ]:
AgglomerativeClustering(linkage='average', n_clusters=6)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(linkage='average', n_clusters=6)
In [ ]:
# Retrieve the cluster labels from the fitted model and add to DataFrames
# Add the labels to the scaled data
df_scaled_hclust['H_Cluster'] = hclust6_model.labels_  #AgglomerativeClustering does not have a .predict() method

# Assign the labels to original dataframe for further analysis
df['H_Cluster'] = hclust6_model.labels_
In [ ]:
print("Groups Assigned : \n")
df.head()
Groups Assigned : 

Out[ ]:
Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio KMeans_Cluster H_Cluster
0 American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219 2 0
1 AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068 2 0
2 Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171 2 0
3 Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651 2 0
4 Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810 4 0
In [ ]:
# Create boxplots for each feature within each cluster created with H Clust method
df_scaled_hclust.boxplot(by='H_Cluster', layout = (5,3),figsize=(20,30))
plt.show()

Observations:

  • Cluster 1 accross all feature has many outliers.

  • Cash Ratio: Most clusters have similar median cash ratios close to 0, with cluster 4 showing a higher median value. Cluster 1 displays outliers with higher cash ratios.

  • Current Price: The median current price across clusters is close to 0, with clusters 1 having outliers on the higher side.

  • Earnings Per Share (EPS): Clusters 3 has lower median EPS values compared to other clusters and shows a wide range of EPS, indicating diverse companies within the cluster.

  • Estimated Shares Outstanding: Most clusters appear to have similar medians around 0 for estimated shares outstanding. Cluster 2 has a notably wider box at higher median than the rest. Cluster 1 has outliers with high shares.

  • Net Cash Flow: Cluster 2 has a median much higher than 0 for net cash flow. Cluster 0 has outliers with both high and low net cash flow values.

  • Net Income: The median net income is around 0 for most clusters, with clusters 2 having slightly higher and 3 lower medians. Cluster 1 shows significant outliers indicating high and low net income.

  • Price-to-Book (P/B) Ratio: All clusters have a median P/B ratio around 0 except 5, which has a higher ratio. Cluster 1 show outliers indicating a higher P/B ratio.

  • Price-to-Earnings (P/E) Ratio: The median P/E ratios are centered around 0 across clusters, with cluster 1 having outliers with high P/E ratios.

  • Price Change: Consistent across clusters with median values around 0 except cluster 2 which is lower. Cluster 1 has outliers indicating large and small price change.

  • Return on Equity (ROE): Cluster 3 has a notably higher median ROE compared to other clusters, which have medians around 0 and shows a a greater range suggesting a more variable ROE within the cluster. Cluster 1 also shows outliers with high ROE ratiow.

  • Volatility: Cluster 3 has a higher median volatility. Cluster 1 shows greater variance and outliers, indicating it contains stocks with a wider range of volatility.

In [ ]:
# Generate a color map with a unique color for each bar
num_of_bars = len(df_scaled_hclust.columns) - 1  # Subtract 1 for the GROUP column
cm = plt.get_cmap('BuPu_r')
colors = [cm(1. * i / num_of_bars) for i in range(num_of_bars)]

# Plot the bar chart with unique colors
grouped_means = df_scaled_hclust.groupby("H_Cluster").mean()
grouped_means.plot.bar(figsize=(15, 8), color=colors)
plt.title("Hierarchical Clustering")
plt.show()

Observations over Hierarchical Clustering¶

Cluster 0 is characterized by minimal variation across all financial metrics, suggesting a grouping of stocks with relatively stable and average financial characteristics.

Cluster 2 stands out with significant negative values in 'Net Income' and 'Earnings Per Share', indicating a cluster of stocks that may be underperforming in profitability.

Cluster 4 is notable for a high positive mean 'Current Price', possibly representing stocks with higher market valuations or stocks that have seen substantial price appreciation.

K-means vs Hierarchical Clustering¶

Execution Time : K-means and hierarchical clustering both executed rapidly, demonstrating that for this dataset, computational efficiency is high with either method.

Distinct Clusters : Both algorithms identified a major cluster with a significant portion of stocks, indicating a commonality in their characteristics. The smaller clusters suggest potential niche segments within the market.

Overlap in Clusters : Cross-tabulation revealed significant overlap of 280 data points between the two methods, indicating some agreement in cluster assignments.

Recommended Clusters : K-Means suggested 5 clusters, while Hierarchical recommended 6, pointing to a slight difference in clustering structure.

Sector Analysis : K-Means uniquely clustered the Energy sector, which Hierarchical did not replicate, showing method-dependent differences in sector-based clustering.

Actionable Insights and Recommendations¶

Client-Centric Portfolios : Trade&Ahead should align the investment recommendations with the client's specific financial goals and risk preferences, possibly using the identified clusters as portfolio frameworks.

Sector-Specific Investments : The unique clustering of the Energy sector in K-Means indicates that it may follow different market dynamics compared to other sectors. Investors could look at this as an opportunity for sector-specific investment strategies or for diversification purposes.

Algorithm Refinement : The difference in cluster suggestions from K-Means and Hierarchical clustering methods indicates a need for algorithm refinement. Combining insights from both methods could lead to a more robust clustering solution.

Product Development : Segmentation might reflect varying technological advancement levels within subindustries. Firms can focus on product development initiatives that cater to the specific trends and customer demands within their segment.